INTRODUCTION

¶

Mô tả ảnh

Currently, the issue of wasteful advertising spending poses a significant challenge for many businesses. They are investing substantial amounts in advertising without achieving the expected results. It's evident that a primary cause of this problem is misidentifying the target customer base. This results in an inefficient allocation of advertising budgets. Therefore, there's a need for solutions to accurately identify customers with high purchasing potential, optimizing costs for each product.

All customer information contains valuable insights into shopping habits and personal preferences. Customer segmentation projects enable businesses to understand specific characteristics of various customer groups better. This not only aids in crafting smarter marketing strategies but also enhances customer experiences and fosters better interactions with the market.

ABOUT THE DATASET

¶

The dataset contains a description of purchasing activities of a company. It comprises 29 columns detailing customers' preferences, shopping habits, and demographics. The features are categorized into five types.

About customers:

# Attribute Description
1 ID Customer unique identifier code.
2 Year_Birth Customer's birth year.
3 Education Customer's educational level.
4 Marital_Status Customer's marital status.
5 Income Monthly income of the customer.
6 Kidhome Number of children in the customer's family.
7 Teenhome Number of teenagers in the customer's family.
8 Dt_Customer Registration date of the customer with the company.
9 Recency Days since the customer's most recent purchase.
10 Complain 1 if the customer has complained in the past 2 years; 0 if not.

About products:

# Attribute Description
11 MntWines Amount spent on wine in the last 2 years.
12 MntFruits Amount spent on fruits in the last 2 years.
13 MntMeatProducts Amount spent on meat products in the last 2 years.
14 MntFishProducts Amount spent on fish products in the last 2 years.
15 MntSweetProducts Amount spent on sweets in the last 2 years.
16 MntGoldProds Amount spent on gold products in the last 2 years.

About product advertising:

# Attribute Description
17 NumDealsPurchases Number of discounted deals the customer has purchased.
18 AcceptedCmp1 1 if the customer accepted the offer in the first marketing campaign, otherwise 0.
19 AcceptedCmp2 1 if the customer accepted the offer in the second marketing campaign, otherwise 0.
20 AcceptedCmp3 1 if the customer accepted the offer in the third marketing campaign, otherwise 0.
21 AcceptedCmp4 1 if the customer accepted the offer in the fourth marketing campaign, otherwise 0.
22 AcceptedCmp5 1 if the customer accepted the offer in the fifth marketing campaign, otherwise 0.
23 Response 1 if the customer accepted the offer in the last marketing campaign, otherwise 0.

Về nơi mua hàng:

# Attribute Description
24 NumWebPurchases Number of purchases made through the company's website.
25 NumCatalogPurchases Number of purchases made using catalogs.
26 NumStorePurchases Number of purchases made directly at the company's stores.
27 NumWebVisitsMonth Number of visits to the company's website in the past month.

Uncategorized:

# Attribute Description
28 Z_CostContact -
29 Z_Revenue -

=> Exclude Z_CostContact and Z_Revenue as they have the same value across all rows. That's why they won't contribute anything to building the model and predictions.

STEP 1: IMPORTING LIBRARIES

¶

In [632]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from pandas import get_dummies
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, LabelEncoder
from yellowbrick.cluster import KElbowVisualizer
import warnings
warnings.filterwarnings('ignore')

STEP 2: LOADING DATASET

¶

In [633]:
df_initial = pd.read_csv('D:\WORKSPACE\PYTHON\Project\Customer Segmentation in Python\Marketing_Campaign\marketing_campaign.csv', sep='\t')
df_initial
Out[633]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 7 0 0 0 0 0 0 3 11 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 5 0 0 0 0 0 0 3 11 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 4 0 0 0 0 0 0 3 11 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 6 0 0 0 0 0 0 3 11 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 ... 5 0 0 0 0 0 0 3 11 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2235 10870 1967 Graduation Married 61223.0 0 1 13-06-2013 46 709 ... 5 0 0 0 0 0 0 3 11 0
2236 4001 1946 PhD Together 64014.0 2 1 10-06-2014 56 406 ... 7 0 0 0 1 0 0 3 11 0
2237 7270 1981 Graduation Divorced 56981.0 0 0 25-01-2014 91 908 ... 6 0 1 0 0 0 0 3 11 0
2238 8235 1956 Master Together 69245.0 0 1 24-01-2014 8 428 ... 3 0 0 0 0 0 0 3 11 0
2239 9405 1954 PhD Married 52869.0 1 1 15-10-2012 40 84 ... 7 0 0 0 0 0 0 3 11 1

2240 rows × 29 columns

STEP 3: DATA CLEANING

¶

In [634]:
df_initial.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Z_CostContact        2240 non-null   int64  
 27  Z_Revenue            2240 non-null   int64  
 28  Response             2240 non-null   int64  
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB

From the results above, we can conclude and note the following:

  • There are missing values in the income.
  • Dt_Customer indicates the date the customer joined, and this data has not been analyzed as DateTime.
  • There are some categorical features in our dataframe (due to some features being of dtype: object). Therefore, we will need to encode them into numerical forms later on.

CHECK THE IMPACT LEVEL OF MISSING DATA ON THE DATASET.

In [635]:
feature_cols = df_initial.columns
print(feature_cols)
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
      dtype='object')
In [636]:
def display_missing(df, feature_cols):
    n_rows = df_initial.shape[0]
    for col in feature_cols:
        missing_count = df[col].isnull().sum()
        if missing_count > 0:
            print(f"{col} has {missing_count*100/n_rows:.2f}% missing values.")
            
display_missing(df_initial, feature_cols)
Income has 1.07% missing values.

It appears that only the Income column currently contains missing values, accounting for 1.07% (equivalent to 24 rows with missing values), which does not significantly affect the dataset. Therefore, these values should be removed from the data.

In [637]:
df_initial.dropna(inplace=True)
print("The total number of data-points after removing the rows with missing values are:", len(df_initial))
The total number of data-points after removing the rows with missing values are: 2216

Reformat the data type for the DT_Customer column.

In [638]:
df_initial['Dt_Customer'] = pd.to_datetime(df_initial['Dt_Customer'], format='%d-%m-%Y')

print("The oldest record on customer's enrollment:", min(df_initial['Dt_Customer']).date())
print("The newest record on customer's enrollment:", max(df_initial['Dt_Customer']).date())
The oldest record on customer's enrollment: 2012-07-30
The newest record on customer's enrollment: 2014-06-29

The results indicate that the oldest registration was on July 30, 2012, and the most recent was on June 29, 2014.

Now we will explore the unique values in the categorical features to have a clearer idea about the data.

In [639]:
print("Total categories in the feature Marital_Status:\n", df_initial["Marital_Status"].value_counts())
print("Total categories in the feature Education:\n", df_initial["Education"].value_counts())
Total categories in the feature Marital_Status:
 Marital_Status
Married     857
Together    573
Single      471
Divorced    232
Widow        76
Alone         3
Absurd        2
YOLO          2
Name: count, dtype: int64
Total categories in the feature Education:
 Education
Graduation    1116
PhD            481
Master         365
2n Cycle       200
Basic           54
Name: count, dtype: int64
  • Attribute Marital_Status: There are 8 distinct values. Upon closer examination, some values can be grouped together, such as: Alone, Single,

Widow, YOLO, Divorced, and Absurb can be represented by the value Alone; Married and Together can be represented by Partner.

  • Attribute Education: There are 5 distinct values. They can be consolidated into new values: Basic representing Basic, Undergraduate

    representing 2nd Cycle, Graduate representing Graduation, and Postgraduate representing Master, PhD.

In [640]:
#Segmenting education levels in three groups
df_initial['Education'] = df_initial['Education'].replace({'Graduation': 'Graduate', 'PhD': 'Postgraduate', 'Master': 'Postgraduate', '2n Cycle': 'Postgraduate', 'Basic': 'Undergraduate'})
df_initial['Marital_Status'] = df_initial['Marital_Status'].replace({'Married': 'Partner', 'Together': 'Partner', 'Single': 'Alone', 'Divorced': 'Alone', 'Widow': 'Alone', 'Absurd': 'Alone', 'YOLO': 'Alone'})

In the next section, we will perform the following steps to design some new features:

Create the Age attribute for customers based on Year_Birth, representing the birth year of the respective individuals up to the closest time point of the year the dataset was collected (2015).

Generate attributes Day, Dayofweek, Month, and Year from the available DT_Customer attribute.

Establish a Total_Children attribute to indicate the total number of children in a household, including children (Kidhome) and teenagers (Teenhome).

Introduce an Is_Parent attribute to display the parent status. If an individual has children (attribute Children has a value of 1 or more), indicating they are a parent, the Is_Parent value will be set to 1; otherwise, it will be set to 0.

Create another attribute, Total_Spent, to indicate the total amount spent by customers across various categories over a two-year period.

In [641]:
#Age of customer today
df_initial['Age'] = 2015 - df_initial['Year_Birth']

# Create a new column from Dt_Customer column
df_initial['Day'] = df_initial['Dt_Customer'].apply(lambda x: x.day)
df_initial['Dayofweek'] = df_initial['Dt_Customer'].apply(lambda x: x.day_name())
df_initial['Month'] = df_initial['Dt_Customer'].apply(lambda x: x.month)
df_initial['Year'] = df_initial['Dt_Customer'].apply(lambda x: x.year)

# Feature for total children in the householde
df_initial['Total_Children'] = df_initial['Kidhome'] + df_initial['Teenhome']

#Feature pertaining parenthood
df_initial['Is_Parent'] = df_initial['Total_Children'].apply(lambda x: 1 if x != 0 else 0)

#Total spendings on various items
df_initial['Total_Spent'] = df_initial['MntWines'] + df_initial['MntFruits'] + df_initial['MntMeatProducts'] + df_initial['MntFishProducts'] + df_initial['MntSweetProducts'] + df_initial['MntGoldProds']

Rename the attributes and remove unnecessary columns.

In [642]:
# Clear description of feature
df_initial.rename(columns={'MntWines':'Wines', 'MntFruits':'Fruits', 'MntMeatProducts':'Meats', 'MntFishProducts':'Fish', 'MntSweetProducts':'Sweets', 'MntGoldProds':'Golds'}, inplace=True)
df_initial.rename(columns={'NumWebPurchases':'Web', 'NumCatalogPurchases':'Catalog', 'NumStorePurchases':'Store'}, inplace=True)

# Delete columns
df_initial.drop(['ID', 'Year_Birth', 'Dt_Customer', 'Z_CostContact', 'Z_Revenue'], axis=1, inplace=True)

We have a table with new attributes.

In [643]:
df = df_initial.copy()
df
Out[643]:
Education Marital_Status Income Kidhome Teenhome Recency Wines Fruits Meats Fish ... Complain Response Age Day Dayofweek Month Year Total_Children Is_Parent Total_Spent
0 Graduate Alone 58138.0 0 0 58 635 88 546 172 ... 0 1 58 4 Tuesday 9 2012 0 0 1617
1 Graduate Alone 46344.0 1 1 38 11 1 6 2 ... 0 0 61 8 Saturday 3 2014 2 1 27
2 Graduate Partner 71613.0 0 0 26 426 49 127 111 ... 0 0 50 21 Wednesday 8 2013 0 0 776
3 Graduate Partner 26646.0 1 0 26 11 4 20 10 ... 0 0 31 10 Monday 2 2014 1 1 53
4 Postgraduate Partner 58293.0 1 0 94 173 43 118 46 ... 0 0 34 19 Sunday 1 2014 1 1 422
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2235 Graduate Partner 61223.0 0 1 46 709 43 182 42 ... 0 0 48 13 Thursday 6 2013 1 1 1341
2236 Postgraduate Partner 64014.0 2 1 56 406 0 30 0 ... 0 0 69 10 Tuesday 6 2014 3 1 444
2237 Graduate Alone 56981.0 0 0 91 908 48 217 32 ... 0 0 34 25 Saturday 1 2014 0 0 1241
2238 Postgraduate Partner 69245.0 0 1 8 428 30 214 80 ... 0 0 59 24 Friday 1 2014 1 1 843
2239 Postgraduate Partner 52869.0 1 1 40 84 3 61 2 ... 0 1 61 15 Monday 10 2012 2 1 172

2216 rows × 32 columns

DATA VISUALIZATION¶

In [644]:
sns.color_palette('copper_r')
Out[644]:

Check for outliers that may affect the subsequent analysis from the dataset.¶

In [645]:
numeric_cols = df.select_dtypes(include=['float','int']).columns

def _plot_numeric_classes(df, col, bins = 10, hist=True, kde=True):
    sns.distplot(df[col],
                bins = bins,
                hist = hist,
                kde = kde)

def _distribution_numberic(df, numeric_cols, row = 6, col = 5, figsize=(40,30), bins = 20):
    print('Number of numberic: ', len(numeric_cols))
    assert row*(col-1)<len(numeric_cols)
    plt.figure(figsize=figsize)
    plt.subplots_adjust(top=None, bottom=None, left=None, right=None,wspace=0.2, hspace=0.5)
    for i in range (1, len(numeric_cols)+1, 1):
        try:
            plt.subplot(row,col,i)
            _plot_numeric_classes(df, numeric_cols[i-1],bins=bins)
            plt.title(numeric_cols[i-1])
        except:
            break
        
_distribution_numberic(df, numeric_cols)
Number of numberic:  29
  • We observe outliers present in three attributes: Age, Income, and Total_Spent. Let's investigate and remove these outliers.
In [646]:
outliers_column = ["Age", "Income", "Total_Spent"]    
def _boxplot_outliers(df, col, color = 'steelblue'):
    for i in range (1, len(col)+1,1):
        plt.subplot(len(col),1,i)
        plt.title(col[i-1])
        sns.boxplot(x=df[col[i-1]], color=color)
        plt.tight_layout()
        plt.show()
        
_boxplot_outliers(df, outliers_column)
In [647]:
print('Income:')
print(df['Income'].sort_values().tail(5))

print('\nAge:')
print(df['Age'].sort_values().tail(5))
Income:
164     157243.0
1300    157733.0
687     160803.0
617     162397.0
2233    666666.0
Name: Income, dtype: float64

Age:
424      74
1950     75
192     115
339     116
239     122
Name: Age, dtype: int64
  • The Age attribute has some outliers with values above 100.
  • The Income attribute has an outlier value of $666,666, whereas the majority of customers have incomes below $200,000.

This significantly impacts subsequent processing. Therefore, these values need to be removed from the dataset.

In [648]:
#Dropping the outliers by setting a cap on Age and Income. 
df = df[(df["Age"] < 80)]
df = df.drop(2233)
print("The total number of data-points after removing the outliers are:", len(df))
The total number of data-points after removing the outliers are: 2212

Overview of the data and correlations between dimensions of data in pairs.¶

In [649]:
# List of columns to be used for the pairplot
data = ['Age', 'Income', 'Total_Spent', 'Recency', 'Is_Parent']
# Create the pairplot and use 'Is_Parent' for coloring points, 'copper_r' palette is used
plot = sns.pairplot(df[data], hue='Is_Parent', palette='copper_r')
# Add a title to the figure
plot.fig.suptitle('Feature Relationship', y=1.05, weight='bold', fontsize=16)
Out[649]:
Text(0.5, 1.05, 'Feature Relationship')
  • We observe that parents tend to spend less money compared to those who are not parents.

  • The spending trend for products is proportional to the customer's income level.

STEP 4: EXPLORATORY DATA ANALYSIS

¶

WHEN DO CUSTOMERS OFTEN REGISTER FOR MEMBERSHIP?¶

In [651]:
fig, axes = plt.subplots(2,2, figsize=(25,8))
axes = axes.flatten()
fig.suptitle("When Did the Customer Enrolled To be a Member", weight='bold', fontsize=16)

times = ['Dayofweek','Day','Month','Year']
def _times_registration(df, times, palette='copper_r'):
    for i in range(len(times)):
        if df[times[i-1]].dtype == 'O':
            sns.countplot(x = df[times[i]], order=list(df[times[i]].unique()), palette=palette, ax=axes[i])
        else:
            sns.countplot(x = df[times[i]], palette=palette, ax=axes[i])
            
_times_registration(df, times)
  • The majority of customers register as members of the company on equivalent days.

  • On the other hand, most customers register on the 12th of each month.

  • The months of March, May, August, and October have an equivalent number of new registrations, with over 200 new members each month. Additionally, the year 2013 contributes the most members simply because it has more data than other years in this dataset.

  • The number of customers registering as members by day of the week is consistent, with an average of 71 new members per day.

  • Customers tend to make purchases more at the beginning and end of the year, possibly because these days often coincide with holidays, leading to increased demand.

  • Due to incomplete data collection for each year (except for 2013), it's not possible to assess the number of customer registrations by year.

DEMOGRAPHICS¶

DISTRIBUTION OF CUSTOMER' AGE¶

In [652]:
# Age
plt.figure(figsize=(20,5))
plt.title('Customers Age Distribution', weight='bold', fontsize=16)
sns.countplot( x = df["Age"], palette='copper_r')
Out[652]:
<Axes: title={'center': 'Customers Age Distribution'}, xlabel='Age', ylabel='count'>
  • Most of the current customers primarily fall into the young customer segment, ranging from those who are preparing to start a family to those who already have families (ages 36 to 50).

  • They are financially independent individuals. This segment also represents the largest revenue source for the business.

THU NHẬP CÁ NHÂN¶

In [653]:
# Income
plt.figure(figsize=(25,5))
sns.displot( x = df["Income"], kde=True, palette = 'copper_r')
plt.title("Customer Income Distribution" , weight='bold', fontsize=16)
Out[653]:
Text(0.5, 1.0, 'Customer Income Distribution')
<Figure size 2500x500 with 0 Axes>
  • The income of the customers ranges from 20,000 USD to 100,000 USD per month.

  • The primary focus is between 30,000 USD and 80,000 USD.

TRÌNH ĐỘ HỌC VẤN¶

In [654]:
# Education
plt.title('Education Levels Among Customers', weight='bold', fontsize=10)
sns.countplot( x=df["Education"], palette='copper_r')
Out[654]:
<Axes: title={'center': 'Education Levels Among Customers'}, xlabel='Education', ylabel='count'>
  • The majority of the company's customers have a university degree or higher.

TÌNH TRẠNG HÔN NHÂN & SỐ LƯỢNG THÀNH VIÊN¶

In [655]:
# Marital Status
status = df['Marital_Status'].value_counts()

Alone = status[['Alone']].sum()
Partner= status[['Partner']].sum()


plt.title('Customer Marital Status', weight='bold', fontsize=10)
# Pie chart
palette = sns.color_palette('copper_r')
plt.pie([Alone, Partner], labels = ['Alone', 'Partner'], colors = palette, autopct='%.0f%%');

SẢN PHẨM¶

SẢN PHẨM VÀ NHÓM TUỔI KHÁCH HÀNG TIÊU THỤ¶

In [656]:
group = pd.cut(df['Age'], [10, 20, 30, 40, 50, 60, 70, 80])
counts = group.value_counts()

plt.figure(figsize=(8, 6))
ax = counts.plot(kind='bar', color='#f4a460')

plt.xlabel('Age Groups', fontsize=12)
plt.ylabel('Counts', fontsize=12)
plt.title('Distributed By Age Group', fontsize=16, weight='bold')

for i, v in enumerate(counts):
    ax.text(i, v + 0.1, str(v), ha='center', va='bottom', fontsize=10)

plt.show()
  • Leading with the customer group in the age range of 40-50, the lowest is in the two age ranges of 10-20 and 70-80.
In [657]:
df2 = df.copy()
df2['Age'] = group

sum_group = df2[['Total_Spent', 'Age']].groupby('Age').sum()
mean_group = df2[['Total_Spent', 'Age']].groupby('Age').mean()

fig, axes = plt.subplots(1,2,figsize=(14,8))
axes = axes.flatten()

sns.barplot(x = sum_group['Total_Spent'], y = sum_group.index, palette='copper_r', ci=None, orient='h', ax=axes[0])
axes[0].set_title('Total Spent on Products\nby Age Groups', weight='bold', fontsize=16)

for i,v in enumerate(sum_group['Total_Spent']):
    if i == 0 or i == 6:
        axes[0].text(v+30000, i, '$ {}'.format(v), horizontalalignment='center', verticalalignment='center', weight='bold', color='black', fontsize=12)
    else:
        axes[0].text(v-40000, i, '$ {}'.format(v), horizontalalignment='center', verticalalignment='center', weight='bold', color='white', fontsize=12)

sns.barplot(x=mean_group['Total_Spent'], y=mean_group.index, palette='copper_r', ci=None, orient='h', ax=axes[1])
axes[1].set_title('Average Spent on Products\nby Age Groups', weight='bold', fontsize=16)

for i,v in enumerate(mean_group['Total_Spent']):
    axes[1].text(v-130, i, '$ {}'.format(round(v,2)), horizontalalignment='center', verticalalignment='center', weight='bold', color='white', fontsize=12)
  • Customers aged between 70-80, despite being a small portion of buyers, are willing to spend a significant amount compared to other groups to purchase various food items such as wine, fruits, meat, and eggs.

  • It's not surprising that the youngest age group, aged 10-20, spends heavily on items like candy.

  • Among the financially independent age groups, there's consistent demand for gold across all, especially within the 60-70 age bracket (though other age groups from 20-60 are not far behind). They tend to accumulate and preserve wealth in the form of gold. They might also be investing in gold for profit.

  • Overall, the revenue generated from sales indicates that a significant portion comes from the sale of wine and meat.

In [658]:
fig, axd = plt.subplot_mosaic([[0,1,2],[3,4,5], [6,6,7], [6,6,7], [6,6,7]],
                              constrained_layout=True, figsize=(18,10))

#fig.suptitle("Customer's Average Spent on Products\nby Age Groups", weight='bold', fontsize=20)

# Bar plot
products = ['Wines','Fruits','Meats','Fish','Sweets','Golds']
def _barplot(df, col):
 for i in range(len(col)):
    sns.barplot(df, x=group, y=col[i], palette='copper_r', ci=None, ax=axd[i])
    axd[i].set_title(col[i], weight='bold') 

_barplot(df, products)
for i in range(6):
    for p in axd[i].patches:
        axd[i].annotate(format(p.get_height(), '.2f'), (p.get_x() + p.get_width() / 2., p.get_height()),
                        ha='center', va='center', xytext=(0, 9), textcoords='offset points', fontsize=10)
# Pie chart
data = df[['Wines', 'Fruits', 'Meats', 'Fish', 'Sweets', 'Golds']].sum().sort_values()

palette = sns.color_palette('copper')
wedges, texts, autotexts = axd[6].pie(x=data, labels=data.index, autopct='%.2f%%', colors=palette, textprops=dict(fontsize=12))
axd[6].set_title("\n\nPercentage of Company's Revenue\nfrom Products", weight='bold', fontsize=20, x=1.2)

for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_weight('bold')

# Text
for i, (name, value) in enumerate(zip(data.index, data)):
    axd[6].text(2.3, 0.5-0.2*i, r"$\bf{" + name + "}$" + "\t:" + " \$" + str(value), fontsize=14)

axd[7].axis('off')
Out[658]:
(0.0, 1.0, 0.0, 1.0)
  • The data distribution is uneven across each group. In total, there are only 21 people in both the youngest and oldest age groups. This is why there's a significant contrast in total expenditures as mentioned above.

  • Due to the unequal number of individuals in each group, we can use the average spending level. In the 70-80 age group, the average spending per purchase is the highest at 1,080.47 USD/product. Similarly, for the 10-20 age group, the average spending per product slightly exceeds that of the 40-50 age group.

In [659]:
fig, axd = plt.subplot_mosaic([[0,1,2], [3,3,4], [3,3,4]],
                               constrained_layout=True, figsize=(18,8))
fig.suptitle("Average Number of Purchases Made\nThrough Different Methods by Age Groups", weight='bold', fontsize=20)

custom_ylim = (0, 8)
plt.setp(axd[0], ylim=custom_ylim)
plt.setp(axd[1], ylim=custom_ylim)

# Bar plot
channel = ['Web', 'Catalog', 'Store']
_barplot(df, channel)

# Pie chart
data = df[['Web', 'Catalog', 'Store']].sum().sort_values()

palette = sns.color_palette('copper')
wedges, texts, autotexts = axd[3].pie(x=data, labels=data.index, autopct='%.2f%%', colors=palette, textprops=dict(fontsize=12));
axd[3].set_title('\n\nPercentage of Purchases Made\nThrough Different Methods', weight='bold', fontsize=20, x=1.35)

for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_weight('bold')

# Text
for i, (name, value) in enumerate(zip(data.index, data)):
    axd[3].text(2.3, 0.3-0.2*i, r"$\bf{" + name + "}$" + "\t:" + str(value) + " times", fontsize=14)

axd[4].axis('off')
Out[659]:
(0.0, 1.0, 0.0, 1.0)
  • Purchases made through the website, catalog, and physical stores are primarily concentrated among individuals aged over 70. Even though there were only 15 people in this age group in the previous survey, it indicates that their purchasing volume is much higher than other age groups.

  • A significant number of orders are executed in physical stores, accounting for 46.20% of the company's total purchases.

PROMOTION¶

SHOPPING FREQUENCY OF AGE GROUPS BASED ON PROMOTION.¶

In [660]:
plt.figure(figsize=(19,7))
plt.subplot(1, 2, 1)
plt.title('Average Number of Purchases Made with a Discount\nby Age Groups', weight='bold', fontsize=12)
ax = sns.barplot(data=df, x=group, y='NumDealsPurchases', hue='Is_Parent', ci=None, palette='copper')

plt.subplot(1, 2, 2)
plt.title('Ratio of Discounted Purchases to Normal Perchases Per Education Level', weight='bold', fontsize=12)

discounted_purchases = df.groupby('Education').NumDealsPurchases.sum()
total_purchases = df.groupby('Education')[['Web', 'Catalog', 'Store']].sum().sum(axis=1)

percentage = round(discounted_purchases * 100 / total_purchases, 2).sort_values(ascending=False)
ax = sns.barplot(y=percentage.index, x=percentage.values,palette='copper')

plt.show()
  • Customers who are married with children are more attracted to promotional programs compared to the other groups.

EVALUATE THE PROMOTIONAL CAMPAIGNS.¶

In [661]:
df.rename(columns = {'Response':'AcceptedCmp6'},inplace=True)
plt.figure(figsize=(10,4))
plt.title('Percentage of Customer Who Accepted the nth Offer', weight ='bold', fontsize = 16)

percent =  df[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp6']].sum()*100/len(df)

ax = percent.plot.bar(color='#c08552')
percent.plot(style='o-', colormap='copper')

plt.setp(ax, ylim=(0, 18))

for i,v in enumerate(percent):
    plt.text(i, v+1, '{:.2f}%'.format(v), horizontalalignment='center', weight='bold', color='Black', fontsize=10)
  • The direct acceptance rate of offers in the first campaign was only 6.42%. The figure was even lower in the second campaign with only 1.36% of customers.

  • The most recent campaign successfully attracted the highest number of customers compared to previous campaigns with a customer rate of 15.05%.

EVALUATE THE CORRELATION BETWEEN VARIABLES.¶

In [662]:
data = df.select_dtypes(include=[np.number]).columns

plt.figure(figsize=(12,10))
plt.title('Feature correlation', weight='bold', fontsize=16, y=1.05)
sns.heatmap(df[data].corr(), cmap='copper')
Out[662]:
<Axes: title={'center': 'Feature correlation'}>
  • The total number of children and the status of being a parent have a low correlation with the customer's total expenditure.

  • As expected, income and all products are the most influential characteristics affecting the total expenditure.

DATA PREPROCESSING

¶

PREPROCESSING INPUT DATA.¶

In [663]:
# Select columns with non-numeric data types
obj_feat = df.select_dtypes(exclude=[np.number]).columns

# Create dummy variables
dummies = get_dummies(df[obj_feat], drop_first=True)

df_final = pd.concat([df, dummies], axis=1)
df_final.drop(obj_feat, axis=1, inplace=True)
df_final.shape
Out[663]:
(2212, 38)
In [664]:
# Normalizes the DataFrame's numerical features. Ensuring that they have a mean close to 0 and a standard deviation close to 1.
scaler = StandardScaler()
scaled = scaler.fit_transform(df_final)

df_final_scaled = pd.DataFrame(scaled, columns=df_final.columns)
df_final_scaled.head()
Out[664]:
Income Kidhome Teenhome Recency Wines Fruits Meats Fish Sweets Golds ... Total_Spent Education_Postgraduate Education_Undergraduate Marital_Status_Partner Dayofweek_Monday Dayofweek_Saturday Dayofweek_Sunday Dayofweek_Thursday Dayofweek_Tuesday Dayofweek_Wednesday
0 0.287105 -0.822754 -0.929699 0.310353 0.977660 1.552041 1.690293 2.453472 1.483713 0.852576 ... 1.676245 -0.944572 -0.158187 -1.349603 -0.423207 -0.412759 -0.396874 -0.401441 2.616099 -0.423207
1 -0.260882 1.040021 0.908097 -0.380813 -0.872618 -0.637461 -0.718230 -0.651004 -0.634019 -0.733642 ... -0.963297 -0.944572 -0.158187 -1.349603 -0.423207 2.422719 -0.396874 -0.401441 -0.382249 -0.423207
2 0.913196 -0.822754 -0.929699 -0.795514 0.357935 0.570540 -0.178542 1.339513 -0.147184 -0.037254 ... 0.280110 -0.944572 -0.158187 0.740959 -0.423207 -0.412759 -0.396874 -0.401441 -0.382249 2.362908
3 -1.176114 1.040021 -0.929699 -0.795514 -0.872618 -0.561961 -0.655787 -0.504911 -0.585335 -0.752987 ... -0.920135 -0.944572 -0.158187 0.740959 2.362908 -0.412759 -0.396874 -0.401441 -0.382249 -0.423207
4 0.294307 1.040021 -0.929699 1.554453 -0.392257 0.419540 -0.218684 0.152508 -0.001133 -0.559545 ... -0.307562 1.058681 -0.158187 0.740959 -0.423207 -0.412759 2.519690 -0.401441 -0.382249 -0.423207

5 rows × 38 columns

In [665]:
# Converts the value of a column to a single integer.
le = LabelEncoder()

for obj in obj_feat[:-1]:
    trans = le.fit_transform(df[obj])
    df[obj] = trans

CLUSTERING¶

The Elbow method is one of the famous techniques to determine the optimal number of clusters in data. We will transform the data, by default computing the SSE (Sum of Squared Errors) or the total squared error (distance) from each point to its assigned centroid using the Euclidean distance. Although SSE will decrease as k increases, ideally, we should have a small number k that still results in a low SSE.

The value of k can vary due to the random initialization of cluster centroids in KMeans Clustering. To make it consistent throughout this notebook, we set the random state to 123.

In [666]:
elbow = KElbowVisualizer(KMeans(random_state=123), k=10)
elbow.fit(df_final_scaled)
elbow.show()
Out[666]:
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
  • We will use k = 5, so we will have 5 clusters in the dataset.
In [667]:
model = KMeans(n_clusters=5, random_state=123)
yhat = model.fit_predict(df_final_scaled)

df['Cluster'] = yhat
df
Out[667]:
Education Marital_Status Income Kidhome Teenhome Recency Wines Fruits Meats Fish ... AcceptedCmp6 Age Day Dayofweek Month Year Total_Children Is_Parent Total_Spent Cluster
0 0 0 58138.0 0 0 58 635 88 546 172 ... 1 58 4 Tuesday 9 2012 0 0 1617 0
1 0 0 46344.0 1 1 38 11 1 6 2 ... 0 61 8 Saturday 3 2014 2 1 27 1
2 0 1 71613.0 0 0 26 426 49 127 111 ... 0 50 21 Wednesday 8 2013 0 0 776 0
3 0 1 26646.0 1 0 26 11 4 20 10 ... 0 31 10 Monday 2 2014 1 1 53 1
4 1 1 58293.0 1 0 94 173 43 118 46 ... 0 34 19 Sunday 1 2014 1 1 422 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2235 0 1 61223.0 0 1 46 709 43 182 42 ... 0 48 13 Thursday 6 2013 1 1 1341 2
2236 1 1 64014.0 2 1 56 406 0 30 0 ... 0 69 10 Tuesday 6 2014 3 1 444 2
2237 0 0 56981.0 0 0 91 908 48 217 32 ... 0 34 25 Saturday 1 2014 0 0 1241 0
2238 1 1 69245.0 0 1 8 428 30 214 80 ... 0 59 24 Friday 1 2014 1 1 843 2
2239 1 1 52869.0 1 1 40 84 3 61 2 ... 1 61 15 Monday 10 2012 2 1 172 1

2212 rows × 33 columns

SIZE OF EACH CLUSTER.¶

In [668]:
sns.countplot(data=df, x='Cluster', palette='copper_r')
Out[668]:
<Axes: xlabel='Cluster', ylabel='count'>

CLUSTER ANALYSIS

¶

INCOME AND TOTAL EXPENDITURE FOR EACH CLUSTER.¶

In [669]:
fig, axd = plt.subplot_mosaic([[0,0],[1,2]], constrained_layout=True, figsize=(14,8))
fig.suptitle('Income vs Total Spent', weight='bold', fontsize=16)

sns.scatterplot(data=df, x='Income', y='Total_Spent', hue='Cluster', palette='copper_r', ax=axd[0])
sns.violinplot(data=df, x='Cluster', y='Total_Spent', palette='copper_r', ax=axd[1])
sns.violinplot(data=df, x='Cluster', y='Income', palette='copper_r', ax=axd[2])
Out[669]:
<Axes: label='2', xlabel='Cluster', ylabel='Income'>

We can see that:

  • Segment 0: High income and expenditure

  • Segment 1: Low income and expenditure

  • Segment 2: Average income and expenditure

  • Segment 3: Lowest income and expenditure

  • Segment 4: Highest income and expenditure.

AGE RANGE AND TOTAL EXPENDITURE.¶

In [670]:
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Age', 'Total_Spent', color='#c08552', fill=True)

g.fig.suptitle('Age and Total Spent by Cluster', fontsize=9, weight='bold')
Out[670]:
Text(0.5, 0.98, 'Age and Total Spent by Cluster')

PARENTAL STATUS AND TOTAL EXPENDITURE.¶

In [671]:
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Is_Parent', 'Total_Spent', color='#c08552', fill=True)

plt.text(0.6,3900, '0: Non Parent\n1: Parent', weight='bold', fontsize=12)
Out[671]:
Text(0.6, 3900, '0: Non Parent\n1: Parent')

NUMBER OF TEENAGERS RELATED TO TOTAL EXPENDITURE BY FAMILY.¶

In [672]:
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Teenhome', 'Total_Spent', color='#c08552', fill=True)
Out[672]:
<seaborn.axisgrid.FacetGrid at 0x25727dc2910>

NUMBER OF CHILDREN RELATED TO TOTAL EXPENDITURE BY FAMILY.¶

In [673]:
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Kidhome', 'Total_Spent', color='#c08552', fill=True)
Out[673]:
<seaborn.axisgrid.FacetGrid at 0x257274e6c90>

TOTAL NUMBER OF CHILDREN RELATED TO TOTAL EXPENDITURE BY FAMILY.¶

In [674]:
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Total_Children', 'Total_Spent', color='#c08552', fill=True)
Out[674]:
<seaborn.axisgrid.FacetGrid at 0x2572a3a3f10>

MARITAL STATUS WITH TOTAL EXPENDITURE.¶

In [675]:
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Marital_Status', 'Total_Spent', color='#c08552', fill=True)

plt.text(0.7,3900, '0: Alone\n1: Partner', weight='bold', fontsize=12)
Out[675]:
Text(0.7, 3900, '0: Alone\n1: Partner')

EDUCATIONAL LEVEL WITH TOTAL EXPENDITURE.¶

In [676]:
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.scatterplot, 'Education', 'Total_Spent', color='#c08552')

plt.text(0.6,3400, '0: Graduate\n1: Postgraduate\n2: Undergraduate', weight='bold', fontsize=12)
Out[676]:
Text(0.6, 3400, '0: Graduate\n1: Postgraduate\n2: Undergraduate')

Conclude the characteristics of each cluster:

Cluster 0:

  • Spans across all age groups.
  • Most of them are not parents.
  • For those who are parents, they have at least one teenager.

Cluster 1:

  • Spans across all age groups.
  • Majority are parents.
  • Have a maximum of 3 children in the household.

Cluster 2:

  • Concentrated in middle-aged groups.
  • Most are parents, except for a very few individuals.
  • Have a maximum of 3 children, primarily teenagers.

Cluster 3:

  • Relatively younger.
  • Have a maximum of 2 children, mainly due to fewer children.
  • College students or those not pursuing a bachelor's degree.

Cluster 4:

  • Spans across all age groups.
  • A large number of them are not parents.

AVERAGE NEAREST DAY OF PURCHASE.¶

In [677]:
plt.figure(figsize=(9,4))
plt.title("Average Number of Days Since Customer's Last Purchase\nby Clusters", weight='bold', fontsize=16)
sns.barplot(data=df, x='Cluster', y='Recency', palette='copper_r', ci=None)
Out[677]:
<Axes: title={'center': "Average Number of Days Since Customer's Last Purchase\nby Clusters"}, xlabel='Cluster', ylabel='Recency'>

All segments show a similar average of around 50 days. This indicates that customers do not have a consistent trend of repurchasing the business's products. It reflects some areas where the business may be lacking in retaining its customers, such as:

  • Product quality and customer service.

  • Ineffective marketing strategies.

  • Failure to stimulate customers to shop more frequently.

  • Increasing market competition.

SHOPPING FOR DIFFERENT TYPES OF PRODUCTS BY SEGMENT.¶

In [690]:
fig, axes = plt.subplots(2,3, figsize=(16,8))
fig.suptitle("Customer's Average Spent on Products\nby Clusters", weight='bold', fontsize=20)
axes = axes.flatten()

def _barplot_cluster(df, col, x = 'Cluster', palette='copper_r', ci=None):
    for i in range(len(col)):
        sns.barplot(df, x=x, y=col[i], palette=palette, ci=ci, ax=axes[i])
        axes[i].set_title(col[i], weight='bold') 
        
_barplot_cluster(df, products)
plt.tight_layout()

Cluster 0: ranks at the top in terms of consumption of items.

Cluster 1: consumes the least among segments.

Cluster 2: has average consumption, with gold products being the most commonly consumed.

Cluster 3: has low consumption levels, almost no consumption of wine products.

Cluster 4: has the highest consumption levels across items.

PRIORITY LEVEL FOR USING DIFFERENT SHOPPING METHODS BY EACH CLUSTER¶

In [691]:
fig, axes = plt.subplots(1,3, figsize=(16,5))
fig.suptitle("Average Number of Purchases Made\nThrough Different Methods by Clusters", weight='bold', fontsize=16)
axes = axes.flatten()

custom_ylim = (0, 8)
plt.setp(axes[0], ylim=custom_ylim)
plt.setp(axes[1], ylim=custom_ylim)

# Bar plot
_barplot_cluster(df, channel)

plt.tight_layout()
  • The first Cluster prioritizes shopping in physical stores the most.

  • Cluster 2 prefers purchasing through websites.

  • Both Cluster 2 and 4 prioritize shopping via catalogs.

AVERAGE NUMBER OF PURCHASES THROUGH VARIOUS CAMPAIGN.¶

In [693]:
fig, axes = plt.subplots(2,3, figsize=(16,8))
fig.suptitle("Average Number of Purchases Made\nThrough Different Methods by Clusters", weight='bold', fontsize=20)
axes = axes.flatten()

campaign = ['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4', 'AcceptedCmp5','AcceptedCmp6']
_barplot_cluster(df, campaign)

plt.tight_layout()
  • Cluster 4 has the strongest response to the campaign.

  • Cluster 3 has the weakest response to the campaign.

  • Program 3 attracts the most participation from various clusters for shopping.

THE AVERAGE NUMBER OF PURCHASES WITH THE DISCOUNT CAMPAIGN.¶

In [681]:
plt.figure(figsize=(9,4))
plt.title('Average Number of Purchases Made with a Discount\nby Clusters', weight='bold', fontsize=16)
sns.barplot(data=df, x='Cluster', y='NumDealsPurchases', ci=None, palette='copper_r')
Out[681]:
<Axes: title={'center': 'Average Number of Purchases Made with a Discount\nby Clusters'}, xlabel='Cluster', ylabel='NumDealsPurchases'>

Cluster 2 is attracted to the most aggressive discount campaign. Evidence shows that the campaign cluster 2 participated in are related to the highest discount levels.

CONCLUSION

¶

CLUSTER 0¶

  • Spans across all age groups.
  • Most of them are not parents.
  • For those who are parents, they have at least one teenager.
  • High income and expenditure levels.
  • They don't show a preference for purchasing products through any specific means (website, catalog, store).

CLUSTER 1¶

  • Spans across all age groups.
  • Majority are parents.
  • Up to 3 children in the household.
  • Lower spending levels.
  • Strongest response to Campaign 6.

CLUSTER 2¶

  • Middle-aged.
  • Mostly parents, with a few exceptions.
  • Up to 3 children, primarily teenagers.
  • The purchase behavior in Campaigns 2 and 4 is slightly higher compared to other segments. A detailed listing of the benefits of these two campaigns and how they influenced purchasing behavior is required. Based on this, a unique attraction method for this segment can be determined.
  • Most attracted to the highest discount programs.
  • Strongest response to Campaign 3.

CLUSTER 3¶

  • Relatively younger.
  • Up to 2 children, mainly due to the number of younger children.
  • University students or those not pursuing a bachelor's degree.

CLUSTER 4¶

  • Spans across all age groups.
  • A significant number are not parents.
  • Highest income and expenditure levels.

GENERAL SUMMARY¶

  • Most of the customers are those with average income levels, lower spending habits, larger family sizes, and are generally not influenced by marketing campaigns.

  • Customers with higher incomes tend to have elevated spending levels and are more responsive to marketing initiatives.

  • Enhance the customer retention programs to ensure repeat purchases.

  • It's essential to reassess the implemented programs and campaigns, evaluating their impact on each segment to discern distinct factors influencing the purchasing behaviors of each segment.